Statistical Fault Detection for Parallel Applications with AutomaDeD
نویسندگان
چکیده
Today’s largest systems have over 100,000 cores, with million-core systems expected over the next few years. The large component count means that these systems fail frequently and often in very complex ways, making them difficult to use and maintain. While prior work on fault detection and diagnosis has focused on faults that significantly reduce system functionality, the wide variety of failure modes in modern systems makes them likely to fail in complex ways that impair system performance but are difficult to detect and diagnose. This paper presents AutomaDeD, a statistical tool that models the timing behavior of each application task and tracks its behavior to identify any abnormalities. If any are observed, AutomaDeD can immediately detect them and report to the system administrator the task where the problem began. This identification of the fault’s initial manifestation can provide administrators with valuable insight into the fault’s root causes, making it significantly easier and cheaper for them to understand and repair it. Our experimental evaluation shows that AutomaDeD detects a wide range of faults immediately after they occur 80% of the time, with a low false-positive rate. Further, it identifies weaknesses of the current approach that motivate
منابع مشابه
Application of Thau Observer for Fault Detection of Micro Parallel Plate Capacitor Subjected to Nonlinear Electrostatic Force
This paper investigates the fault detection of a micro parallel plate capacitor subjected to nonlinear electrostatic force. For this end Thau observer, which has good ability in fault detection of nonlinear system has been presented and governing nonlinear dynamic equation of the capacitor has been presented. Upper and lower threshold for fault detection have been obtained. The robustness of th...
متن کاملAn approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملReversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs
Reversible logic is one of the new paradigms for power optimization that can be used instead of the current circuits. Moreover, the fault-tolerance capability in the form of error detection or error correction is a vital aspect for current processing systems. In this paper, as the multiplication is an important operation in computing systems, some novel reversible multiplier designs are propose...
متن کاملA new technique for bearing fault detection in the time-frequency domain
This paper presents a new Fast Kurtogram Method in the time-frequency domain using novel types of statistical features instead of the kurtosis. For this study, the problem of four classes for Bearing Fault Detection is investigated using various statistical features. This research is conducted in four stages. At first, the stability of each feature for each fault mode is investigated. Then, res...
متن کاملFault Detection and Classification in Double-Circuit Transmission Line in Presence of TCSC Using Hybrid Intelligent Method
In this paper, an effective method for fault detection and classification in a double-circuit transmission line compensated with TCSC is proposed. The mutual coupling of parallel transmission lines and presence of TCSC affect the frequency content of the input signal of a distance relay and hence fault detection and fault classification face some challenges. One of the most effective methods fo...
متن کامل